Multi-Modal Sentiment Analysis Using Text, Audio, And Facial Expressions for Human Emotion Detection - A Survey

Authors: Anshika Saxena, Dr. Shweta Singh

DOI Link: https://doi.org/10.22214/ijraset.2026.77563

Abstract

Human emotion recognition has become a significant research focus within artificial intelligence due to its growing importance in human–computer interaction, affective computing, and intelligent decision-support systems. Conventional emotion recognition methods have largely relied on unimodal data sources, such as text, speech, or facial expressions. Although effective in controlled settings, unimodal approaches often provide an incomplete and ambiguous understanding of emotional expression, as human emotions are inherently multimodal. This review paper critically examines a dissertation that proposes a deep learning-based multimodal sentiment analysis framework for human emotion detection by integrating textual, acoustic, and facial expression modalities. The reviewed framework employs a Long Short-Term Memory (LSTM)-based architecture to effectively model temporal and contextual dependencies present in multimodal data. Textual information is encoded using embedded word sequences, audio data captures emotional prosody through acoustic features, and visual inputs represent facial expression patterns. These modality-specific features are fused within a unified deep learning framework to perform binary emotion classification. Experimental evaluation using standard performance metrics, including accuracy, precision, recall, F1-score, confusion matrix analysis, and training–validation curves, demonstrates an overall classification accuracy of 82.22 percent, along with balanced precision and recall values. The review highlights the robustness, methodological soundness, and practical relevance of multimodal sentiment analysis, emphasizing its advantages over unimodal approaches and its contribution to the advancement of affective computing research.

Introduction

Emotion recognition is a key area in artificial intelligence, as emotions strongly influence human behavior, communication, and decision-making. Traditional approaches relied on unimodal data such as text, speech, or facial expressions, but these methods often fail to capture the full complexity of emotions due to limitations like ambiguity in text, noise in audio, and variability in facial expressions.

To overcome these challenges, the study adopts a multimodal sentiment analysis approach that integrates text, audio, and facial data. This approach reflects how humans naturally interpret emotions by combining multiple cues. The framework uses deep learning, particularly Long Short-Term Memory (LSTM) networks, to capture temporal and contextual emotional patterns across different modalities.

The evolution of emotion recognition progressed from lexicon-based methods to machine learning and finally to deep learning, which enables automatic feature extraction and better handling of complex data. Multimodal systems further improve accuracy and robustness by combining complementary information and reducing reliance on any single data source.

The proposed model uses a multimodal dataset and processes each modality separately before combining them in an LSTM-based architecture for binary emotion classification. Experimental results show good performance, achieving about 82.22% accuracy with balanced precision, recall, and F1-scores, indicating reliable and consistent predictions.

Overall, the study demonstrates that multimodal emotion recognition is more effective and practical than unimodal approaches, with applications in human–computer interaction, mental health, and intelligent systems. However, limitations include dataset diversity, subjective labeling, and binary classification, suggesting future improvements such as multi-class models, advanced architectures, and real-time deployment considerations.

Conclusion

This review paper has synthesized and critically examined a dissertation that proposes a deep learning-based multimodal sentiment analysis framework for human emotion detection. The reviewed work is grounded in the recognition that human emotions are inherently complex, dynamic, and multimodal, and therefore cannot be reliably captured through single-channel analysis alone. By integrating textual, acoustic, and facial expression modalities within a unified Long Short-Term Memory–based deep learning architecture, the dissertation addresses fundamental limitations associated with traditional unimodal emotion recognition approaches. The review highlights how the combined use of multiple modalities enables a more comprehensive and human-like interpretation of emotional states by leveraging complementary emotional cues that are otherwise overlooked in unimodal systems. A key contribution emphasized in this review is the effective use of LSTM networks to model temporal and contextual dependencies across multimodal data. Emotional expression often evolves over time, particularly in spoken communication, and the ability to capture such temporal dynamics is essential for accurate emotion detection. The reviewed framework demonstrates stable learning behaviour, as evidenced by consistent training and validation performance, indicating strong generalization capability and limited overfitting. The reported overall classification accuracy of 82.22 percent, along with balanced precision, recall, and F1-score values, underscores the robustness and practical viability of the proposed approach in handling emotionally ambiguous and noisy real-world data.The review further illustrates that the observed misclassifications primarily occur near emotional boundaries, where textual, vocal, and facial cues may convey mixed or subtle emotional signals. Such cases reflect the inherent subjectivity of emotion perception rather than deficiencies in the model itself. Importantly, the multimodal integration strategy adopted in the reviewed dissertation helps mitigate these ambiguities by compensating for weaknesses in individual modalities, thereby enhancing classification reliability.Overall, the reviewed dissertation represents a meaningful and timely contribution to the field of affective computing and multimodal emotion recognition. It provides a structured and scalable framework that balances accuracy, computational efficiency, and methodological clarity. By demonstrating the tangible benefits of multimodal sentiment analysis over unimodal approaches, the work establishes a strong foundation for future research aimed at developing more adaptive, context-aware, and human-centric emotion recognition systems.

References

[1] Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607 [2] Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. https://doi.org/10.1109/72.279181 [3] Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. [4] Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28(2), 15–21. https://doi.org/10.1109/MIS.2013.30 [5] Cambria, E., Poria, S., Bajpai, R., & Schuller, B. (2018). SenticNet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). [6] Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. [7] Ekman, P. (1992). An argument for basic emotions. Cognition & Emotion, 6(3–4), 169–200. https://doi.org/10.1080/02699939208411068 [8] Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Springer. [9] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 [10] Kouloumpis, E., Wilson, T., & Moore, J. (2011). Twitter sentiment analysis: The good the bad and the OMG! Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, 538–541. [11] Liu, B. (2012). Sentiment analysis and opinion mining. Morgan & Claypool Publishers. [12] Morency, L.-P., Mihalcea, R., & Doshi, P. (2011). Toward multimodal sentiment analysis: Harvesting opinions from the web. Proceedings of the 13th International Conference on Multimodal Interfaces, 169–176. https://doi.org/10.1145/2070481.2070516 [13] Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2), 1–135. https://doi.org/10.1561/1500000011 [14] Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98–125. https://doi.org/10.1016/j.inffus.2017.02.003 [15] Poria, S., Hazarika, D., Majumder, N., & Mihalcea, R. (2018). Multimodal sentiment analysis: Addressing key issues and setting up baselines. IEEE Intelligent Systems, 33(6), 17–25. https://doi.org/10.1109/MIS.2018.023001545 [16] Schuller, B., Steidl, S., Batliner, A., et al. (2018). A survey on automatic speech emotion recognition: Features, classification, and data sets. Speech Communication, 66, 79–97. https://doi.org/10.1016/j.specom.2014.12.004 [17] Socher, R., Perelygin, A., Wu, J., et al. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1631–1642. [18] Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301–1311. https://doi.org/10.1109/JSTSP.2017.2764438 [19] Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39–58. https://doi.org/10.1109/TPAMI.2008.52 [20] Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1253. https://doi.org/10.1002/widm.1253

Copyright

Copyright © 2026 Anshika Saxena, Dr. Shweta Singh. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET77563

Publish Date : 2026-02-18

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here